Fix GCI mounter issue #38124

jingxu97 · 2016-12-05T18:20:42Z

Two fixes in this PR

Remove break in the mounter script to make sure gc run multiple times
To make sure existed container created by mounter script gets garbage collected, we make rkt gc run multiple times
Fix unmount issue cased by GlusterFS mounter
When GlusterFS volume is mounted on GCI image cluster, the GCI mounter script will create a container and then execute mount in the container. Because mount for GlusterFS will run a glusterfs daemon, the container created by the mounter will continue be in running state and rkt garbage collection will not collect it. Because of the shared mount propagation setting, any mounts on the host will be reflected to the container which might cause a large mount table. Also, in current umount device logic, we check the mount references before issuing the operations. In this case, unmount device will fail because of the extra mount points in the container. The following gives an example of original mount path and the shared mount point created in container.

a. original mount
/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/nfs-disk rw,relatime shared:133 - ext4 /dev/sdb rw,data=ordered
b. shared mount in container
/var/lib/rkt/pods/run/3496eadc-74d9-4349-9370-db7636f5f963/stage1/rootfs/opt/stage2/gci mounter/rootfs/var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/nfs-disk rw,relatime shared:133 - ext4 /dev/sdb rw,data=ordered

The current workaround for this issue is: since these shared mounts are irrelevant to the original mounts, they should be not considered when checking the mount references. By comparing the mount paths, the shared mount path contains the full original mount path so they should be filtered out.

Remove break in the mounter script to make sure gc run multiple times

k8s-reviewable · 2016-12-05T18:20:50Z

This change is

saad-ali · 2016-12-05T23:13:14Z

pkg/util/mount/mount.go

 	// Find all references to the device.
 	var refs []string
 	if deviceName == "" {
 		glog.Warningf("could not determine device for path: %q", mountPath)
 	} else {
 		for i := range mps {
 			if mps[i].Device == deviceName && mps[i].Path != slTarget {
-				refs = append(refs, mps[i].Path)
+				mpPrefix := getPodsDirPrefix(mps[i].Path, options.DefaultKubeletPodsDirName)
+				if mpPrefix == podsDirPrefix {


My worry with this is that it may skip checks on non-rkt mounts as well. Let's see if we can minimize this.

Changed to filter out the mount refs based on string "gci-mounter" in the path

k8s-ci-robot · 2016-12-05T23:32:37Z

Jenkins verification failed for commit 6cb0ae2. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-ci-robot · 2016-12-06T00:25:17Z

Jenkins GKE smoke e2e failed for commit a22f52e. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

vishh · 2016-12-06T02:01:25Z

pkg/util/mount/mount.go

@@ -130,14 +130,24 @@ func GetMountRefs(mounter Interface, mountPath string) ([]string, error) {
 		}
 	}

+	// TODO: this is a workaround for the unmount device issue caused by gci mounter.


This is a total hack. gci_mounter is a GKE specific deployment detail. We should not be leaking that logic into upstream.

We discussed a few options, but they are all seem hacky...

vishh · 2016-12-06T02:15:16Z

cluster/gce/gci/mounter/mounter

@@ -31,7 +31,7 @@ function gc {
    # Rkt pods end up creating new copies of mounts on the host. Hence it is ideal to clean them up right away.
    attempt=0
    until [ $attempt -ge 5 ]; do
-	${RKT_BINARY} gc --grace-period=0s &> /dev/null && break


Why are you making this change?

otherwise, it only run gc once.

Why is that a problem?

when I test it, I think sometimes the container still in running state in the moment when gc is executed. Give a few try can give more chances to garbage collect the containers . I also thought it was intended to execute 5 times.

It was only intended to run multiple times if any of the previous attempts fail. Blocking for 5 seconds doesn't sound like a good idea. Is it possible to detect incomplete gc and then retry?
Also why is GC of concern?

If the container is not GC, it causes those extra mounts which makes

mount table is very big if there are many mount points, which might be a performance concern since we list the mount points during unmount operations

Without this PR's fix, unmount device will fail because of those unrelated mount references. (umount device function first check whether there are other mount references for the mount path and will fail the operation if there are some exist)

1 might be of concern. Do we have any data to back up that concern?
I don't get 2. Why is umount failing if other shared recursive mounts exist? Is the umount syscall failing or is it our own custom logic?

@jingxu97

it is our own logic to check the mount references.

Can't we fix that logic then? As long as there is no separate mount point with the same device (bind mount), we can safely unmount. Check the mount reference number in /proc/self/mountinfo to make that decision.

If a mount appears in two locations via a shared mount then /proc/self/mountinfo will explicitly indicate that. Here is an example.

$ cat /proc/self/mountinfo ... 90 80 0:66 / /tmp/original/tmpfs rw,relatime **shared:2** - tmpfs tmpfs rw 103 101 0:66 / /tmp/shared/tmpfs rw,relatime **shared:2** - tmpfs tmpfs rw

Look at the mount propagation type followed by a propagation number. If the number is the same, then that mount can be ignored. A umount from any one of those paths will result in the mount disappearing from both locations.

$ sudo umount /tmp/original/tmpfs vishnuk@vishnuk-linux:~$ cat /proc/self/mountinfo ...

The current solution used here is checking the mount path and the shared mount path created by the rkt container. If the shared mount path contains the original mount path, this will be filtered out during GetMountRef. See comments #38124 (comment)

jingxu97 · 2016-12-06T06:40:11Z

@k8s-bot cvm gke e2e test this

jingxu97 · 2016-12-06T06:56:03Z

it is our own logic to check the mount references.

…

On Mon, Dec 5, 2016 at 10:52 PM, Vish Kannan ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In cluster/gce/gci/mounter/mounter <#38124>: > @@ -31,7 +31,7 @@ function gc { # Rkt pods end up creating new copies of mounts on the host. Hence it is ideal to clean them up right away. attempt=0 until [ $attempt -ge 5 ]; do - ${RKT_BINARY} gc --grace-period=0s &> /dev/null && break 1 might be of concern. Do we have any data to back up that concern? I don't get 2. Why is umount failing if other shared recursive mounts exist? Is the umount syscall failing or is it our own custom logic? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#38124>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASSNxeJff85Qf5NZp7QRL53kd8TJ770Zks5rFQY8gaJpZM4LEjHE> .

-- - Jing

saad-ali · 2016-12-06T07:09:39Z

Please finalize on this PR ASAP, cutoff for 1.5 is Tuesday.

vishh · 2016-12-06T07:51:45Z

I'm not comfortable with this PR because of #38124 (comment)

jingxu97 · 2016-12-06T08:53:41Z

I am wondering whether it is enough to only check the propagation type and its value. Isn't it possible that some other mount points that also have the same value? By checking the mount path strings (see the example below), the shared mount path should contain the full original mount path. So it looks like by checking the mount paths and peer group X value (shared:X) it might good enough to filter out the propagated mount refs. 351 314 8:16 / /var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/nfs-disk rw,relatime shared:133 - ext4 /dev/sdb rw,data=ordered 353 28 8:16 / /var/lib/kubelet/plugins/kubernetes.io/gce-pd/mounts/nfs-disk rw,relatime shared:133 - ext4 /dev/sdb rw,data=ordered 352 911 8:16 / /var/lib/rkt/pods/run/3496eadc-74d9-4349-9370-db7636f5f963/stage1/rootfs/opt/stage2/gci-mounter/rootfs/var/lib/kubelet/plugins/ kubernetes.io/gce-pd/mounts/nfs-disk rw,relatime shared:133 - ext4 /dev/sdb rw,data=ordered

…

On Mon, Dec 5, 2016 at 11:22 PM, Vish Kannan ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In cluster/gce/gci/mounter/mounter <#38124>: > @@ -31,7 +31,7 @@ function gc { # Rkt pods end up creating new copies of mounts on the host. Hence it is ideal to clean them up right away. attempt=0 until [ $attempt -ge 5 ]; do - ${RKT_BINARY} gc --grace-period=0s &> /dev/null && break If a mount appears in two locations via a shared mount then /proc/self/mountinfo will explicitly indicate that. Here is an example. $ cat /proc/self/mountinfo ... 90 80 0:66 / /tmp/original/tmpfs rw,relatime **shared:2** - tmpfs tmpfs rw 103 101 0:66 / /tmp/shared/tmpfs rw,relatime **shared:2** - tmpfs tmpfs rw Look at the mount propagation type followed by a propagation number. If the number is the same, then that mount can be ignored. A umount from any one of those paths will result in the mount disappearing from both locations. $ sudo umount /tmp/original/tmpfs ***@***.***:~$ cat /proc/self/mountinfo ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38124>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASSNxToh8NuuUoTS4PrS756Yaok4yvNQks5rFQ0_gaJpZM4LEjHE> .

-- - Jing

jingxu97 · 2016-12-06T19:48:41Z

Updated the PR. PTAL.

k8s-ci-robot · 2016-12-06T20:09:27Z

Jenkins unit/integration failed for commit 887cd40. Full PR test history.

The magic incantation to run this job again is @k8s-bot unit test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

this is a workaround for the unmount device issue caused by gci mounter. In GCI cluster, if gci mounter is used for mounting, the container started by mounter script will cause additional mounts created in the container. Since these mounts are irrelavant to the original mounts, they should be not considered when checking the mount references. By comparing the mount path prefix, those additional mounts can be filtered out. Plan to work on better approach to solve this issue.

saad-ali · 2016-12-06T22:22:46Z

@vishh Without this change when anyone uses GlusterFS via mounter, all other mounts will fail to unmount. That's pretty bad. We need this fixed for the 1.5 release ASAP. This is hacky and I'd like to have a cleaner solution, but it gets the job done. @jingxu97 please open an issue to revisit and fix this code with a more robust solution.

saad-ali · 2016-12-06T22:22:51Z

/lgtm

k8s-github-robot · 2016-12-06T23:37:23Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2016-12-07T00:21:05Z

Automatic merge from submit-queue

vishh · 2016-12-07T02:52:22Z

I hope by merging this in a hurry, we did not introduce new bugs :)

saad-ali · 2016-12-07T02:56:36Z

+1 We'll let it bake overnight in master. If the world hasn't burned down, we'll cherry pick this in the AM (Wed morning PST).

vishh · 2016-12-07T03:03:15Z

It would help if Jing can add more comments to the PR explaining the logic that she introduced using some example.

…

On Wed, Dec 7, 2016 at 8:26 AM, Saad Ali ***@***.***> wrote: +1 We'll let it bake overnight in master. If the world hasn't burned down, we'll cherry pick this in the AM (Wed morning PST). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGvIKJCXZ5sdN_UFxUYVWyqwkPMriI1Vks5rFiB6gaJpZM4LEjHE> .

…24-upstream-release-1.5 Automatic merge from submit-queue Automated cherry pick of #38124 Cherry pick of #38124 on release-1.5. #38124: Fix GCI mounter script to run garbage collection multiple

k8s-cherrypick-bot · 2016-12-07T23:42:39Z

Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

This is a fix on top kubernetes#38124. In this fix, we move the logic to filter out shared mount references into operation_executor's UnmountDevice function to avoid this part is being used by other types volumes such as rdb, azure etc. This filter function should be only needed during unmount device for GCI image.

Automatic merge from submit-queue Fix unmountDevice issue caused by shared mount in GCI This is a fix on top #38124. In this fix, we move the logic to filter out shared mount references into operation_executor's UnmountDevice function to avoid this part is being used by other types volumes such as rdb, azure etc. This filter function should be only needed during unmount device for GCI image.

This is a fix on top kubernetes#38124. In this fix, we move the logic to filter out shared mount references into operation_executor's UnmountDevice function to avoid this part is being used by other types volumes such as rdb, azure etc. This filter function should be only needed during unmount device for GCI image.

…24-upstream-release-1.4 Automated cherry pick of #38124

Fix GCI mounter script to run garbage collection multiple times

3a1cf2d

Remove break in the mounter script to make sure gc run multiple times

jingxu97 added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 5, 2016

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 5, 2016

jingxu97 added this to the v1.5 milestone Dec 5, 2016

jingxu97 assigned saad-ali, vishh and mtaufen Dec 5, 2016

k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 5, 2016

jingxu97 changed the title ~~Fix GCI mounter script to run garbage collection multiple times~~ Fix GCI mounter issue Dec 5, 2016

jingxu97 added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 5, 2016

saad-ali suggested changes Dec 5, 2016

View reviewed changes

jingxu97 force-pushed the Dec/gluster branch from 6cb0ae2 to a22f52e Compare December 5, 2016 23:39

vishh reviewed Dec 6, 2016

View reviewed changes

jingxu97 force-pushed the Dec/gluster branch from a22f52e to 887cd40 Compare December 6, 2016 19:47

saad-ali added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Dec 6, 2016

saad-ali approved these changes Dec 6, 2016

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 6, 2016

saad-ali removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Dec 6, 2016

k8s-github-robot merged commit 65ed735 into master Dec 7, 2016

saad-ali added the cherrypick-candidate label Dec 7, 2016

thockin deleted the Dec/gluster branch December 7, 2016 06:47

jingxu97 mentioned this pull request Dec 7, 2016

Automated cherry pick of #38124 #38328

Merged

saad-ali added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Dec 7, 2016

k8s-cherrypick-bot removed the cherrypick-candidate label Dec 7, 2016

jingxu97 mentioned this pull request Dec 8, 2016

Node E2E: Broken Pipe during test on GCI. #37333

Closed

This was referenced Dec 8, 2016

Fix unmountDevice issue caused by shared mount in GCI #38411

Merged

Automated cherry pick of #38124 #38416

Merged

jessfraz added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Dec 8, 2016

jessfraz added a commit that referenced this pull request Dec 9, 2016

Merge pull request #38416 from jingxu97/automated-cherry-pick-of-#381…

3327a41

…24-upstream-release-1.4 Automated cherry pick of #38124

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GCI mounter issue #38124

Fix GCI mounter issue #38124

jingxu97 commented Dec 5, 2016 •

edited

k8s-reviewable commented Dec 5, 2016

saad-ali Dec 5, 2016

jingxu97 Dec 5, 2016

k8s-ci-robot commented Dec 5, 2016

k8s-ci-robot commented Dec 6, 2016

vishh Dec 6, 2016

jingxu97 Dec 6, 2016

vishh Dec 6, 2016

jingxu97 Dec 6, 2016

vishh Dec 6, 2016

jingxu97 Dec 6, 2016

vishh Dec 6, 2016

jingxu97 Dec 6, 2016

vishh Dec 6, 2016

vishh Dec 6, 2016

vishh Dec 6, 2016

jingxu97 Dec 6, 2016

jingxu97 commented Dec 6, 2016

jingxu97 commented Dec 6, 2016 via email

saad-ali commented Dec 6, 2016

vishh commented Dec 6, 2016

jingxu97 commented Dec 6, 2016 via email

jingxu97 commented Dec 6, 2016

k8s-ci-robot commented Dec 6, 2016

saad-ali commented Dec 6, 2016

saad-ali commented Dec 6, 2016

k8s-github-robot commented Dec 6, 2016

k8s-github-robot commented Dec 7, 2016

vishh commented Dec 7, 2016

saad-ali commented Dec 7, 2016

vishh commented Dec 7, 2016 via email

k8s-cherrypick-bot commented Dec 7, 2016

Fix GCI mounter issue #38124

Fix GCI mounter issue #38124

Conversation

jingxu97 commented Dec 5, 2016 • edited

k8s-reviewable commented Dec 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 5, 2016

k8s-ci-robot commented Dec 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingxu97 commented Dec 6, 2016

jingxu97 commented Dec 6, 2016 via email

saad-ali commented Dec 6, 2016

vishh commented Dec 6, 2016

jingxu97 commented Dec 6, 2016 via email

jingxu97 commented Dec 6, 2016

k8s-ci-robot commented Dec 6, 2016

saad-ali commented Dec 6, 2016

saad-ali commented Dec 6, 2016

k8s-github-robot commented Dec 6, 2016

k8s-github-robot commented Dec 7, 2016

vishh commented Dec 7, 2016

saad-ali commented Dec 7, 2016

vishh commented Dec 7, 2016 via email

k8s-cherrypick-bot commented Dec 7, 2016

jingxu97 commented Dec 5, 2016 •

edited